/*****************************************************************
* Do File: calibration_targets.do
*
* Purpose:
*   This file uses clean_census_tract.dta as input. It first classifies
*   each census tract into one of three neighborhoods (A, B, or C) based 
*   on the share of rich families (defined based on the metro income cutoff) 
*   residing in each tract. It then computes statistics (moments) to 
*   calibrate the model.
*
* Statistics (Labels Only):
*   "Neighborhood A Size"
*   "Neighborhood B Size"
*   "Neighborhood C Size"

*   "Share of A's population that are rich"
*   "Share of B's population that are rich"

*   "Share of A's population with college"
*   "Share of B's population with college"

*   "Rent Ration A/B"
*   "Rent Ratio B/C"

*   "Dissim Index, Neighborhood (Top 20% Rich)"
*	"Gini Index"
* 	"Income Ratio p25/p75"
*****************************************************************/





****************************************************************************** 
********** Data on the population
******************************************************************************

use "$output/clean_census_tract.dta", clear
keep if ind_children==1
collapse (sum) population = f_count_ct_kids, by(year metro)
tempfile new_popweight
save `new_popweight', replace  // this is the populaion weight according to family count with kids 

****************************************************************************** 
********** Define neighborhoods by calculating the share of top and bottom 20% population
******************************************************************************
**Section: This part defines neighborhoods by analyzing the income distribution.
**It uses the family count in different income brackets to compute cumulative shares (CDF),
**identifies the brackets closest to specified percentiles (80% and 20%), and assigns flags accordingly.


use "$output/clean_census_tract.dta", clear
keep if ind_children==1
keep year metro countyfips fips bracket_no f_count_ct_kids
drop if bracket_no == .  


** Families in the top 20% of income distribution at MSA are rich others are poor

* Calculate the cumulative share to identify percentiles
collapse (sum) f_count_ct, by(year metro bracket_no)
bysort year metro (bracket_no): egen total = sum(f_count_ct)
gen share = f_count_ct / total
**Collapses data by summing family counts (f_count_ct) for each income bracket (bracket_no) within every year and metro.
**Then calculates the total number of families and computes the share of families in each bracket.

* Find the income bucket that is closest to the 80th percentile
bysort year metro (bracket_no) :  gen cdf = sum(share)
gen mydiff = cdf - 0.8
gen mydiff_b20 = cdf - 0.2
**For each income bracket, the cumulative distribution function (cdf) is computed.
**The differences (mydiff and mydiff_b20) measure how far each cdf is from the targeted percentiles (80% and 20%).

** Assign the two income brackets closest to 0.8 of cdf flag of -1 and 1
bys year metro: gen term1 = abs(mydiff)+(mydiff>0)
bys year metro: gen term2 = abs(mydiff)
bys year metro: egen term3 = min(term1)
bys year metro: gen term4 = term3 == term2
**Calculate auxiliary variables (term1, term2, term3, term4) that help identify the two brackets 
**that are closest to the 80th percentile. The flag values (-1 and 1) will later distinguish between the lower and higher brackets.

** Assign the two income brackets closest to 0.2 of cdf flag of -1 and 1
bys year metro: gen term1_b20 = abs(mydiff_b20)+(mydiff_b20>0)
bys year metro: gen term2_b20 = abs(mydiff_b20)
bys year metro: egen term3_b20 = min(term1_b20)
bys year metro: gen term4_b20 = term3_b20 == term2_b20

**A similar procedure is repeated for the 20th percentile (bottom 20%) to identify the income brackets close to that cutoff.

sort metro year bracket_no
bys year metro (bracket_no): gen flag = term4[_n-1]
bys year metro (bracket_no): gen flag2 = -1*flag[_n+1]
bys year metro (bracket_no): gen flag_b20 = term4_b20[_n-1]
bys year metro (bracket_no): gen flag2_b20 = -1*flag_b20[_n+1]
**These commands shift the flag values by one observation (using [_n-1] and [_n+1]) so that adjacent income brackets
**are marked. The use of negative and positive flags will help later in assigning weight proportions.

replace flag = flag+flag2 // this gives the required flag
replace flag_b20 = flag_b20+flag2_b20 
**Combining flag values (flag and flag2) results in a single flag indicating the relative position (low or high)
**for the income bracket with respect to the targeted percentile.

**** For each Metro keep only the two income brackets close to 0.8 cdf
replace flag = flag*cdf
gen type = ""
replace type = "Low" if flag < 0
replace type = "High" if flag >= 0
**The flag is multiplied by the cumulative share (cdf) to weigh the income bracket.

**** For each Metro keep only the two income brackets close to 0.2 cdf
replace flag_b20 = flag_b20*cdf
gen type_b20 = ""
replace type_b20 = "Low" if flag_b20 < 0
replace type_b20 = "High" if flag_b20 >= 0

keep if flag_b20 != 0  | flag != 0
drop if flag_b20 == . & flag == . 

** This is some technique of assigning weights to the two income brackets close to 0.8 
// Note: This is a process of finding w such that w*cdf_L + (1-w)*cdf_H = 0.8
// That is we are trying to assign weights to the two income brackets such that 80% is poor and rest is rich in Metro


bys year metro: egen abflag = sum(abs(flag))  // This is taking the sum: cdf_L + cdf_H
bys year metro: egen suflag = sum(flag) // This is taking difference: cdf_H - cdf_L
replace abflag = . if flag == 0
replace suflag = . if flag == 0

bys year metro: egen abflag_b20 = sum(abs(flag_b20))  // This is taking the sum: cdf_L + cdf_H for bottom 20%
bys year metro: egen suflag_b20 = sum(flag_b20) // This is taking difference: cdf_H - cdf_L for bottom 20%
replace abflag_b20 = . if flag_b20 == 0
replace suflag_b20 = . if flag_b20 == 0

***top 20%
gen w_low = (abflag-0.8-cdf)/suflag
gen w_high = (0.8+cdf-abflag)/suflag
gen weight = .
replace weight = w_low if w_low > 0 
replace weight = w_high if w_high > 0
replace weight = 0 if weight < 0
replace weight = 1 if weight > 1
replace weight = . if w_high ==.
**These lines calculate the weights for the top 20% (rich) income brackets.
**They determine the contribution of each bracket (low and high) using the computed differences,
**and then ensure that weights are within sensible bounds (0 to 1).

***bottom 20%
gen w_low_b20 = (abflag_b20-0.2-cdf)/suflag_b20
gen w_high_b20 = (0.2+cdf-abflag_b20)/suflag_b20
gen weight_b20 = .
replace weight_b20 = w_low_b20 if w_low_b20 > 0 
replace weight_b20 = w_high_b20 if w_high_b20 > 0
replace weight_b20 = 0 if weight_b20 < 0
replace weight_b20 = 1 if weight_b20 > 1
replace weight_b20 = . if w_high_b20 ==.
**Similarly, weights for the bottom 20% (poor) brackets are computed and adjusted.

keep year metro bracket_no weight weight_b20

tempfile weight
save `weight', replace

**************************************************

use "$output/clean_census_tract.dta", clear
keep if ind_children==1
keep year metro countyfips fips bracket_no f_count_ct_kids f_count_ct_wokids low high income 
merge m:1 year metro bracket_no using `weight'
drop _merge
sort year metro fips bracket_no 
** At this point we have weights assigned to two income brackets close to 0.8 for each MSA 

** Based on the MSA income distribution, we want the share of rich and poor families in CT
** Get the CDF at census tract level 
** Then share_poor = w*cdf_L + (1-w)*cdf_H where L and H are the two income brackets close to 0.8

bys year metro fips: egen f_count_ct = sum(f_count_ct_kids)  // f_count_ct is the count of families in the ct
gen share = f_count_ct_kids/f_count_ct
bys year metro fips: gen cdf = sum(share)
bys year metro fips: egen ff = sum(f_count_ct_kids)
*top 20%
gen cutoff = cdf*weight
bys year metro fips: egen cut = sum(cutoff) // cut is share of poor in each census tract
**The “cutoff” is calculated as the product of the cumulative share and the computed weight.
**Then, by summing these cutoff values within each census tract, the code determines the share of poor families.

keep year metro fips bracket_no cut ff weight_b20 cdf
rename cut cutoff
rename ff f_count_ct
gen rich_share = 1-cutoff
**The variable “rich_share” is defined as the complement of the “cutoff” (i.e. 1 minus the poor share).
**This provides the share of rich families in the tract.

*bottom 20%
sort year metro fips bracket_no
gen cut_b20 = cdf*weight_b20
bys year metro fips: egen cutoff_b20 = sum(cut_b20) // cut is share of poor in each census tract
**The same procedure is repeated for the bottom 20% weighting – calculating a cutoff based on weight_b20.

keep year metro fips cutoff cutoff_b20 rich_share f_count_ct
gen rich_share_b20 = 1-cutoff_b20
duplicates drop    

gen X4 = .
replace X4 = 1 if rich_share >= 0.30  // we select the cutoff to get a population distribution.
replace X4 = 2 if rich_share >= 0.17 & rich_share < 0.30
replace X4 = 3 if rich_share < 0.17
**A new categorical variable “X4” is created based on the share of rich families.
**This variable is used later to denote neighborhood types (for example, A, B, C categories).

tempfile nbr_old
save `nbr_old', replace

******************************************************************************
********** Dissimilarity index TOP 20% RICH (NBD: A, B, C)
******************************************************************************
**Section: The following block computes the dissimilarity index (a measure of segregation)
**for the top 20% rich families aggregated at the neighborhood level (with categories A, B, and C).

use `nbr_old', clear
keep if year==1980
gen poor = cutoff*f_count_ct
gen rich = rich_share*f_count_ct // f_count_ct = poor + rich
**Here the number of poor and rich families per tract are calculated using the cutoff and rich_share.

collapse (sum) poor rich f_count_ct, by(year metro X4)
bys year metro: egen poor_metro = sum(f_count_ct*0.8)
bys year metro: egen rich_metro = sum(f_count_ct*0.2) 
bys year metro: egen f_count_ct_kids = sum(f_count_ct)
replace poor = poor/poor_metro
replace rich = rich/rich_metro 
**The totals for poor and rich families are computed at the metro level.
**Then each tract’s poor and rich counts are normalized by the metro-level totals.

gen mydiff = abs(rich - poor)/2
rename mydiff dissim
rename f_count_ct pop
collapse (sum) dissim pop, by(year metro)
collapse (mean) dissim [aweight = pop], by(year)
**The dissimilarity index is calculated as half the absolute difference between the normalized rich and poor proportions,
**aggregated first by metro and then averaged across metros (weighted by population).

rename dissim Dis_T20_nbd
label variable Dis_T20_nbd Dis_T20_nbd

tempfile Dis_T20_nbd
save `Dis_T20_nbd'

******************************************************************************
********** Rental Ratios 
******************************************************************************
**Section: This block calculates rental ratios by merging housing-related variables from the cleaned dataset
**with the neighborhood assignments (nbr_old). It computes average rent and mortgage costs to form ratios.





use "$output/clean_census_tract.dta", clear 
keep if ind_housing==1
keep  year metro countyfips fips *_rent *_mortgage *_units 
collapse (mean) *_rent *_mortgage *_units, by(year metro countyfips fips)

merge 1:1 year metro fips using `nbr_old'
keep if _merge == 3
gen avg_rent = agg_rent/cash_units
replace median_rent = avg_rent if median_rent == 0 
**After merging with neighborhood data, the code calculates average rent as the ratio of aggregated rent to cash units,
**and uses it to replace any zero median rent values.

drop if median_rent == .
keep if median_rent > 0
drop if median_costs_mortgage == .
keep if median_costs_mortgage > 0
**Ensure that only valid (non-missing and positive) rental and mortgage cost observations remain.

bys year metro X4: egen tot_rent = total(median_rent*cash_units)
bys year metro X4: egen tot_own = total(median_costs_mortgage*counts_mortgage)
bys year metro X4: egen units_rent = total(cash_units)
bys year metro X4: egen units_own = total(counts_mortgage)
bys year metro X4: egen population = total(f_count_ct)
**Total values (for rent, ownership, and units) are computed per neighborhood (X4 grouping) within each metro and year.

keep year metro X4 tot_rent tot_own units_rent units_own population
duplicates drop
**Unneeded variables are dropped and duplicates are removed.

gen avg_costs = (tot_rent+tot_own)/(units_rent+units_own)
gen units = units_rent + units_own
**The overall average cost (combining rent and mortgage) is computed by dividing the total costs by total units,
**and a combined count of housing units is also generated.

preserve
collapse (mean) avg_costs [aweight=units], by(year metro X4)  // avg_cost is same as above
tempfile int
save `int'
restore

collapse (sum) population, by(year metro X4)
collapse (sum) population, by(year metro)
merge 1:m year metro using `int'
bys year metro: gen obs = _N
keep if obs == 3
sort year metro X4
bys year metro (X4): gen HouseBC = avg_costs[2]/avg_costs[3]
bys year metro (X4): gen HouseAB = avg_costs[1]/avg_costs[2]
keep year metro population HouseAB HouseBC
duplicates drop
**Two ratios are produced: HouseAB compares groups A and B, and HouseBC compares groups B and C.

collapse (mean) HouseAB HouseBC [aweight=population], by(year)
tempfile rental_ratio
save `rental_ratio'

 
******************************************************************************
********** College shares
******************************************************************************
**Section: This section computes the percentage of college graduates in each neighborhood (A, B, C).
**It merges the college counts from the census data with the neighborhood file.

** Calculate the percentage of college graduates in neighborhoods A, B, and C
use "$output/clean_census_tract.dta", clear
keep if year==1980
keep if ind_children==1
drop if f_count_ct_kids == . 
keep year metro fips college total
rename college allcollege
rename total population
duplicates drop
merge 1:1 year metro fips using `nbr_old'
keep if _merge == 3
drop _merge
collapse (sum) population allcollege, by(year metro X4)
gen college = allcollege/population
collapse (mean) college [aweight=pop], by(year X4)
reshape wide college, i(year) j(X4)
rename college1 College_A
rename college2 College_B
rename college3 College_C
label variable College_A College_A
label variable College_B College_B
label variable College_C College_C
**The share of college graduates is calculated for each neighborhood by dividing the total number of college graduates
**by the total population, and then averaging these shares using population weights.
drop College_C

tempfile college
save `college'
 
******************************************************************************
********** Share of rich people in each neighborhood (as fraction of total neighborhod population)
******************************************************************************
**Section: This block computes the share of rich families in each neighborhood (by fraction of total tract population)
**and then averages these shares across neighborhoods using population weights.

use "$output/clean_census_tract.dta", clear
keep if year==1980
keep if ind_children==1
drop if f_count_ct_kids == .
keep year metro fips 
duplicates drop
merge 1:1 year metro fips using `nbr_old'
keep if _merge == 3
drop _merge
gen rich = rich_share*f_count_ct
collapse (sum) rich f_count_ct, by(year metro X4)
bys year metro X4: egen rich_total = sum(rich)  // rich_total is same as rich 
gen rich_share = rich_total/f_count_ct
collapse (mean) rich_share, by(year metro X4)
merge m:1 year metro using `new_popweight'
keep if _merge == 3
collapse (mean) rich_share [aweight=population], by(year X4)
reshape wide rich_share, i(year) j(X4)
rename rich_share1 ShareOfRich_A
rename rich_share2 ShareOfRich_B
rename rich_share3 ShareOfRich_C
label variable ShareOfRich_A ShareOfRich_A
label variable ShareOfRich_B ShareOfRich_B
label variable ShareOfRich_C ShareOfRich_C
**The share of rich residents is recalculated by aggregating within neighborhood types.
**The resulting variables (ShareOfRich_A, _B, and _C) indicate the fraction of rich people in each neighborhood.
drop ShareOfRich_C 

tempfile rich_share
save `rich_share'

 
******************************************************************************
********** Neighborhood Sizes
******************************************************************************
**Section: This block computes the relative size of each neighborhood (A, B, C) by evaluating the 
**share of the total population residing in each category.

** New NBR sizes
use "$output/clean_census_tract.dta", clear
keep if ind_children==1
drop if f_count_ct_kids == .
keep year metro fips
duplicates drop
merge 1:1 year metro fips using `nbr_old'
keep if _merge == 3
drop _merge
collapse (sum) f_count_ct, by(year metro X4)
by year metro: egen pop = sum(f_count_ct)
gen share = f_count_ct/pop
keep year metro X4 share pop
merge m:1 year metro using `new_popweight'
keep if _merge == 3
collapse (mean) share [aweight=population], by(year X4)
reshape wide share, i(year) j(X4)
rename share1 Size_A
rename share2 Size_B
rename share3 Size_C
label variable Size_A Size_A
label variable Size_B Size_B
label variable Size_C Size_C
**The “Size” of each neighborhood is defined as the share of total population in that category.
**These sizes are later used as weights for merging and computing overall averages.

tempfile size
save `size'
 
 
************
** Gini 
*********** 
use "$output/metro_gini_all", clear
keep if year==1980
keep year famkids_weighted 
duplicates drop year famkids_weighted, force 
rename famkids_weighted gini
tempfile gini 
save `gini', replace  




************
** Calcualte the income percentile ratio 
***************
use "$output/clean_census_tract.dta", clear
keep if year==1980 
keep if ind_children==1
collapse (p25) p25_inc = mean_kids_income (p75) p75_inc = mean_kids_income [aw=f_count_ct_kids], by(year)
gen p25_p75_ratio = p25_inc/p75_inc
keep year p25_p75_ratio
tempfile p25p75_ratio 
save `p25p75_ratio', replace  











 
 
******************************************************************************
********** Merging all the statistics together
******************************************************************************
**Final Section: All computed statistics (neighborhood sizes, distribution of rich, college shares,
**income ratios, rental ratios, dissimilarity indices, etc.) are merged into a single dataset for export.

merge 1:1 year using `size'
drop _merge


merge 1:1 year using `rich_share'
drop _merge

merge 1:1 year using `college'
drop _merge


merge 1:1 year using `rental_ratio'
drop _merge

merge 1:1 year using `Dis_T20_nbd'
drop _merge

merge 1:1 year using `gini'
drop _merge

merge 1:1 year using `p25p75_ratio'
drop _merge 




** Label all the variables 
label variable Size_A "Neighborhood A Size"
label variable Size_B "Neighborhood A Size"
label variable Size_C "Neighborhood A Size"

label variable ShareOfRich_A "Share of A's population that are rich"
label variable ShareOfRich_B "Share of B's population that are rich"

label variable College_A "Share of A's population with college"
label variable College_B "Share of B's population with college"

label variable HouseAB "Rent Ration A/B"
label variable HouseBC "Rent Ratio B/C"

label variable Dis_T20_nbd "Dissim Index, Neighborhood (Top 20% Rich)" 

label variable gini "Gini Index"
label variable p25_p75_ratio "Income Ration p25/p75"


rename * value_*
rename value_year year
reshape long value_, i(year) j(variable) string
reshape wide value_, i(variable) j(year)

gen var_order = . 
replace var_order=1 if variable=="gini"
replace var_order=2 if variable=="Dis_T20_nbd"
replace var_order=3 if variable=="p25_p75_ratio"
replace var_order=4 if variable=="Size_A"
replace var_order=5 if variable=="Size_B"
replace var_order=6 if variable=="Size_C"
replace var_order=7 if variable=="ShareOfRich_A"
replace var_order=8 if variable=="ShareOfRich_B"
replace var_order=9 if variable=="HouseAB"
replace var_order=10 if variable=="HouseBC"
replace var_order=11 if variable=="College_A"
replace var_order=12 if variable=="College_B"

sort var_order
drop var_order


** Rename the variable as they appaear in the paper 
replace variable="Gini" if variable=="gini"
replace variable="Dissimilarity" if variable=="Dis_T20_nbd"
replace variable="Income Ratio p25/p75" if variable=="p25_p75_ratio"
replace variable="Size A" if variable=="Size_A"
replace variable="Size B" if variable=="Size_B"
replace variable="Size C" if variable=="Size_C"
replace variable="Share of Rich A" if variable=="ShareOfRich_A"
replace variable="Share of Rich B" if variable=="ShareOfRich_B"
replace variable="Rent A/B" if variable=="HouseAB"
replace variable="Rent B/C" if variable=="HouseBC"
replace variable="College share A" if variable=="College_A"
replace variable="College share B" if variable=="College_B"


export excel using "$output/calibration_targets.xlsx", firstrow(variables) replace
